The purpose of this project is to generate a model that will predict whether people will get heart disease or not based on some key indicators.
“According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol.” (From Kaggle website)
In order to give you a better understanding of heart disease, its harm and the key factors that may cause it, here are a few videos for you:
#install.packages('vembedr')
library(vembedr)
embed_url("https://www.youtube.com/watch?v=g131j2lb3xw")
embed_url("https://www.youtube.com/watch?v=u7k6sqTxOCU&t=139s")
In fact, not only in the United States, heart disease is also an important cause of death in the world. Moreover, the number of deaths from heart disease is increasing every year in the world. Therefore, we need a model to predict people’s risk of heart disease according to some important indicators, such as their BMI, whether they smoke, whether they drink alcohol and so on. By using this model, we can remind people to pay attention to health and lifestyle habits to help them avoid heart disease.
This project uses Kamil Pytlak’s dataset from Kaggle.
According to Kamil Pytlak, his dataset originally comes from the 2020 annual CDC survey data of 400k adults related to their health status. However, he cleaned the original CDC survey data and selected the most relevant variables from it in order to help us to do the machine learning projects related to heart disease.
This dataset contains 319795 observations with 18 variables (9 booleans, 5 strings and 4 decimals). HeartDisease is the response variable. Other 17 variables are our predictors. The full copy of the codebook of this unprocessed dataset available in my zipped files, but in order to help us to better understand these variables,I also show some important parts of this unprocessed dataset’s codebook of here.
HeartDisease: Whether the respondent has ever been
diagnosed with heart disease
BMI: The body mass index of the respondent
Smoking: Whether the respondent has smoked 100
cigarettes in his/her entire life. [ Note: 5 packs = 100 cigarettes
]
AlcoholDrinking: Whether the respondent is a heavy
drinkers [ Note: A heavy drinker is a adult men who having more than 14
drinks per week or a adult women who having more than 7 drinks per week
]
Stroke: Whether the respondent had a stroke
PhysicalHealth: The number of days that the
respondent had poor physical health in the past 30 days [ Note: It
includes physical illness and injury ]
MentalHealth: The number of days that the respondent
had poor mental health in the past 30 days
DiffWalking: Whether the respondent has serious
difficulty walking or climbing stairs
Sex: Whether the respondent is male or
female
AgeCategory: What age group is the respondent in [
Note: the answer should be ‘18-24’, ‘25-29’, ‘30-34’, ‘35-39’, ‘40-44’,
‘45-49’, ‘50-54’, ’55-59’, ’60-64’, ’65-69’, ’70-74’, ’75-79’, ’80 or
older’]
Race: The race of the respondent [Note: the answer
should be ‘American Indian/Alaskan Native’, ‘Asian’, ‘Black’,
‘Hispanic’, ‘White’, ‘Other’]
Diabetes: Whether the respondent had diabetes [
Notes : the answer should be ‘No’, ’No, borderline diabetes’, ‘Yes’,
‘Yes (during pregnancy)’ ]
PhysicalActivity: Whether the respondent did
physical activity or exercise during the past 30 days other than their
regular job
GenHealth: The respondent’s health assessment of
his/her self in general [ Notes : the answer should be ‘Very good’,
‘Good’, ‘Excellent’, ‘Fair’, ‘Poor’ ]
SleepTime: The number of hours of sleep of the
respondent in a 24-hour period
Asthma: Whether the respondent had asthma
KidneyDisease: Whether the respondent had kidney
disease [ Note: not including kidney stones, bladder infection or
incontinence ]
SkinCancer: Whether the respondent had skin
cancer
Note: a full copy of the codebook is available in my zipped files.
# install.packages('caret')
# install.packages("ROSE")
# load packagges
library(tidyverse)
library(tidymodels)
library(ISLR)
library(ISLR2)
library(discrim)
library(corrr)
library(rpart.plot)
library(vip)
library(janitor)
library(randomForest)
library(xgboost)
library(corrplot)
library(glmnet)
library(ranger)
library(caret)
library(klaR)
library(dplyr)
tidymodels_prefer()
library(ROSE)
library(kknn)
set.seed(1234)
# read the the dataset
records<- read.csv("heart_2020_cleaned.csv")
head(records) # show the first few rows of the dataset
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 1 No 16.60 Yes No No 3 30
## 2 No 20.34 No No Yes 0 0
## 3 No 26.58 Yes No No 20 30
## 4 No 24.21 No No No 0 0
## 5 No 23.71 No No No 28 0
## 6 Yes 28.87 Yes No No 6 0
## DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity GenHealth
## 1 No Female 55-59 White Yes Yes Very good
## 2 No Female 80 or older White No Yes Very good
## 3 No Male 65-69 White Yes Yes Fair
## 4 No Female 75-79 White No No Good
## 5 Yes Female 40-44 White No Yes Very good
## 6 Yes Female 75-79 Black No No Fair
## SleepTime Asthma KidneyDisease SkinCancer
## 1 5 Yes No Yes
## 2 7 No No No
## 3 8 Yes No No
## 4 6 No No Yes
## 5 8 No No No
## 6 12 No No No
While the data set that was downloaded was tidy, a few different cleaning steps were necessary before the split occurred:
records <-records %>% clean_names()
head(records)
## heart_disease bmi smoking alcohol_drinking stroke physical_health
## 1 No 16.60 Yes No No 3
## 2 No 20.34 No No Yes 0
## 3 No 26.58 Yes No No 20
## 4 No 24.21 No No No 0
## 5 No 23.71 No No No 28
## 6 Yes 28.87 Yes No No 6
## mental_health diff_walking sex age_category race diabetic
## 1 30 No Female 55-59 White Yes
## 2 0 No Female 80 or older White No
## 3 30 No Male 65-69 White Yes
## 4 0 No Female 75-79 White No
## 5 0 Yes Female 40-44 White No
## 6 0 Yes Female 75-79 Black No
## physical_activity gen_health sleep_time asthma kidney_disease skin_cancer
## 1 Yes Very good 5 Yes No Yes
## 2 Yes Very good 7 No No No
## 3 Yes Fair 8 Yes No No
## 4 No Good 6 No No Yes
## 5 Yes Very good 8 No No No
## 6 No Fair 12 No No No
table(records$heart_disease)
##
## No Yes
## 292422 27373
We find that our response variable is highly imbalanced. There are much more observations on ‘No’ levels than ‘Yes’ levels. We need to use some functions to deal with this problem, otherwise it will have a serious impact on our predictions. According to the TA, we can use ovun.sample() function to help us to deal with it.
set.seed(1234)
newrecords<- ovun.sample(heart_disease~.,data = records,
p=0.5,seed = 1,method = "under")$data
table(newrecords$heart_disease)
##
## No Yes
## 27387 27373
Although there are more ‘No’, our response variable is almost balanced.
sum(is.na(newrecords))
## [1] 0
# It means there is no missing value in our data.
# We continue the process of data cleaning.
heart_disease, smoking,
alcohol_drinking, stroke,
diff_walking, sex, age_category,
race, diabetic,
physical_activity, gen_health,
asthma, kidney_disease,
skin_cancer to factorsnewrecords <- newrecords %>%
mutate(
heart_disease = factor(heart_disease, levels = c('Yes', 'No')),
smoking = factor(smoking, levels = c('Yes', 'No')),
alcohol_drinking = factor(alcohol_drinking, levels = c('Yes', 'No')),
stroke = factor(stroke, levels = c('Yes', 'No')),
diff_walking = factor(diff_walking, levels = c('Yes', 'No')),
sex = factor(sex),
age_category = factor(age_category),
race = factor(race, levels = c("American Indian/Alaskan Native", "Asian", "Black", "Hispanic", "White", "Other")),
diabetic = factor(diabetic, levels = c("No", "No, borderline diabetes", "Yes", "Yes (during pregnancy)")),
physical_activity = factor(physical_activity),
gen_health = factor(gen_health, levels = c("Excellent","Very good", "Good", "Fair", "Poor" )),
asthma = factor(asthma, levels = c('Yes', 'No')),
kidney_disease = factor(kidney_disease, levels = c('Yes', 'No')),
skin_cancer = factor(skin_cancer, levels = c('Yes', 'No')),
)
head(newrecords)
## heart_disease bmi smoking alcohol_drinking stroke physical_health
## 1 No 33.84 No No No 0
## 2 No 31.75 No Yes No 0
## 3 No 33.64 No No No 0
## 4 No 24.56 No No No 0
## 5 No 40.69 Yes No No 30
## 6 No 27.89 No No No 0
## mental_health diff_walking sex age_category race diabetic
## 1 2 No Female 45-49 White No
## 2 0 No Male 55-59 White No
## 3 28 No Male 40-44 Black Yes
## 4 0 No Female 40-44 Asian No
## 5 0 Yes Male 60-64 White Yes
## 6 0 No Male 75-79 White No
## physical_activity gen_health sleep_time asthma kidney_disease skin_cancer
## 1 Yes Very good 6 No No Yes
## 2 Yes Excellent 7 No No Yes
## 3 Yes Good 7 No No No
## 4 Yes Excellent 6 No No No
## 5 Yes Fair 7 No No No
## 6 Yes Very good 5 No No Yes
# show me how many observations in the new dataset
# show me how many variables in the new dataset
dim(newrecords)
## [1] 54760 18
This entire exploratory data analysis will be based only on the
entire set, which has 54760 observations with 18 variables. Each
observation represents a single newrecords class.
heart_disease.newrecords %>%
ggplot(aes(x = heart_disease)) +
geom_bar()
Now, let’s analyze variable bmi.
bmi.hist(newrecords$bmi, main = paste("Histogram of BMI"), xlab = 'The value of BMI', ylab = 'The number of respondents')
* The distribution of
bmi definitely appears to be left
skewed, and it has a long right tail. It also almost looks a normal
distribution. There’s one peak around 25-30. Most people have a BMI
below 40.
bmi by
heart_diseasenewrecords %>%
ggplot(aes(x = bmi, y=heart_disease))+
geom_boxplot() +
xlab("The BMI of the respondent")
* Based on the graph, we can find that the respondent who have higher
BMI is more likely to get heart disease. ( We will confirm this result
at the end of this project )
smoking.newrecords %>%
ggplot(aes(x = smoking)) +
geom_bar()
* According to the graph, we find that there are much more observations
on ‘No’ levels than ‘Yes’ levels for variable smoking.
smoking by
heart_diseasenewrecords %>%
ggplot(aes(x= smoking, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
alcohol_drinking.newrecords %>%
ggplot(aes(x = alcohol_drinking)) +
geom_bar()
According to the graph, we find that there are much more
observations on ‘No’ levels than ‘Yes’ levels for variable
alcohol_drinking. For this reason, it maybe hard for us to find the
relationship between alcohol_drinking and
heart_disease.
Second, draw a plot of variable alcohol_drinking by
heart_disease
newrecords %>%
ggplot(aes(x= alcohol_drinking, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
stroke.newrecords %>%
ggplot(aes(x = stroke)) +
geom_bar()
According to the graph, we find that there are much more
observations on ‘No’ levels than ‘Yes’ levels for variable
stroke.For this reason, it maybe hard for us to find the
relationship between stroke and
heart_disease.
Second, draw a plot of variable stroke by
heart_disease
newrecords %>%
ggplot(aes(x= stroke, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
physical_health:
The number of days that the respondent had poor physical health in the
past 30 days [ Note: It includes physical illness and injury ]hist(newrecords$physical_health, main = paste("Histogram of physical_health"), xlab = 'The number of days that the respondent had poor physical health in the past 30 days', ylab = 'The number of respondents')
According to the graph, we find that most of respondents had less than 10 poor physical health day in the past 30 days, and there are some respondents had 30 poor physical health day in the past 30 days.
Second, draw a boxplot of variable physical_health
by heart_disease
newrecords %>%
ggplot(aes(x =physical_health , y=heart_disease))+
geom_boxplot() +
xlab("The number of days that the respondent had poor physical health in the past 30 days")
mental_health: The
number of days that the respondent had poor mental health in the past 30
dayshist(newrecords$mental_health, main = paste("Histogram of mental_health"), xlab = 'The number of days that the respondent had poor mental health in the past 30 days', ylab = 'The number of respondents')
According to the graph, we find that most of respondents had less than 10 poor mental health day in the past 30 days, and there are some respondents had 30 poor mental health days in the past 30 days.
Second, draw a boxplot of variable mental_health by
heart_disease
newrecords %>%
ggplot(aes(x =mental_health , y=heart_disease))+
geom_boxplot() +
xlab("The number of days that the respondent had poor mental health in the past 30 days")
diff_walking.newrecords %>%
ggplot(aes(x = diff_walking)) +
geom_bar()
According to the graph, we find that there are much more
observations on ‘No’ levels than ‘Yes’ levels for variable
diff_walking.For this reason, it maybe hard for us to find
the relationship between diff_walking and
heart_disease.
Second, draw a plot of variable diff_walking by
heart_disease
newrecords %>%
ggplot(aes(x= diff_walking, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
sex.newrecords %>%
ggplot(aes(x = sex)) +
geom_bar()
According to the graph, we find that there are more observations
on ‘Male’ levels than ‘Female’ levels for variable sex.
Since the difference between these two levels is no very big, it won’t
make bad influence.
Second, draw a plot of variable sex by
heart_disease
newrecords %>%
ggplot(aes(x= sex, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
age_category.newrecords %>%
ggplot(aes(x = age_category)) +
geom_bar()
According to the graph, we find that there are more older respondents (older than 50) than young respondents ( younger than 50).
Second, draw a plot of variable age_category by
heart_disease
newrecords %>%
ggplot(aes(x= age_category, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
race.newrecords %>%
ggplot(aes(x = race)) +
geom_bar()
According to the graph, we find that most of the respondents are
white. For this reason, it maybe hard for us to find the relationship
between race`` andheart_disease`.
Second, draw a plot of variable race by
heart_disease
newrecords %>%
ggplot(aes(x= race, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
diabetic.newrecords %>%
ggplot(aes(x = diabetic)) +
geom_bar()
According to the graph, we find that most of the respondents are
on level ‘No’, and some of the respondents are on level ‘Yes’. For this
reason, it maybe hard for us to find the relationship between
diabetic and heart_disease.
Second, draw a plot of variable diabetic by
heart_disease
newrecords %>%
ggplot(aes(x= diabetic, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
head(newrecords)
## heart_disease bmi smoking alcohol_drinking stroke physical_health
## 1 No 33.84 No No No 0
## 2 No 31.75 No Yes No 0
## 3 No 33.64 No No No 0
## 4 No 24.56 No No No 0
## 5 No 40.69 Yes No No 30
## 6 No 27.89 No No No 0
## mental_health diff_walking sex age_category race diabetic
## 1 2 No Female 45-49 White No
## 2 0 No Male 55-59 White No
## 3 28 No Male 40-44 Black Yes
## 4 0 No Female 40-44 Asian No
## 5 0 Yes Male 60-64 White Yes
## 6 0 No Male 75-79 White No
## physical_activity gen_health sleep_time asthma kidney_disease skin_cancer
## 1 Yes Very good 6 No No Yes
## 2 Yes Excellent 7 No No Yes
## 3 Yes Good 7 No No No
## 4 Yes Excellent 6 No No No
## 5 Yes Fair 7 No No No
## 6 Yes Very good 5 No No Yes
physical_activity.
physical_activity: Whether the respondent did physical
activity or exercise during the past 30 days other than their regular
job.newrecords %>%
ggplot(aes(x = physical_activity)) +
geom_bar()
According to the graph, we find that most of the respondents are
on level ‘Yes’, and some of the respondents are on level ‘No’. For this
reason, it maybe hard for us to find the relationship between
physical_activity and heart_disease.
Second, draw a plot of variable physical_activity by
heart_disease
newrecords %>%
ggplot(aes(x= physical_activity, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
gen_health.
gen_health: The respondent’s health assessment of his/her
self in general [ Notes : the answer should be ‘Very good’, ‘Good’,
‘Excellent’, ‘Fair’, ‘Poor’ ]newrecords %>%
ggplot(aes(x = gen_health)) +
geom_bar()
According to the plot, we can see that most of respondents think they are in good and above health, and there are some respondents think they are in fair or poor health.
Second, draw a plot of variable gen_health by
heart_disease
newrecords %>%
ggplot(aes(x= gen_health, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
sleep_timehist(newrecords$sleep_time, main = paste("Histogram ofsleep time "), xlab = 'The number of hours of sleep of the respondent in a 24-hour period', ylab = 'The number of respondents')
According to the graph, we know that the distribution of
sleep_time definitely appears to be left skewed, and it has
a long right tail. It also almost looks a normal distribution. There’s
one peak around 7-8 hour. Most people have a sleep time between 5- 8
hour.
Second, draw a boxplot of variable sleep_time by
heart_disease
newrecords %>%
ggplot(aes(x = sleep_time, y=heart_disease))+
geom_boxplot() +
xlab("The number of hours of sleep of the respondent in a 24-hour period")
newrecords %>%
ggplot(aes(x = asthma)) +
geom_bar()
According to the graph, we find that there are much more
observations on ‘No’ levels than ‘Yes’ levels for variable
asthma.For this reason, it maybe hard for us to find the
relationship between asthma and
heart_disease.
Second, draw a plot of variable asthma by
heart_disease
newrecords %>%
ggplot(aes(x= asthma, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
* Based on the graph, we can find that the respondent who had asthma is
more likely to get heart disease. ( We will confirm this result at the
end of this project )
newrecords %>%
ggplot(aes(x = kidney_disease)) +
geom_bar()
According to the graph, we find that there are much more
observations on ‘No’ levels than ‘Yes’ levels for variable
kidney_disease.For this reason, it maybe hard for us to
find the relationship between kidney_disease and
heart_disease.
Second, draw a plot of variable kidney_disease by
heart_disease
newrecords %>%
ggplot(aes(x= kidney_disease, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
* Based on the graph, we can find that the respondent who had
kidney_disease is more likely to get heart disease. ( We will confirm
this result at the end of this project )
newrecords %>%
ggplot(aes(x = skin_cancer)) +
geom_bar()
According to the graph, we find that there are much more
observations on ‘No’ levels than ‘Yes’ levels for variable
skin_cancer.For this reason, it maybe hard for us to find
the relationship between skin_cancer and
heart_disease.
Second, draw a plot of variable skin_cancer by
heart_disease
newrecords %>%
ggplot(aes(x= skin_cancer, y= heart_disease , fill= heart_disease)) +
geom_bar(stat="identity")+theme_minimal()
We completed the process of exploratory data analysis.
The data was split in a 80% training, 20% testing split. Stratified
sampling was used as the heart_disease distribution was
skewed. (See more on that in the EDA).
The data split was conducted prior to the EDA as I did not want to know anything about my testing data set before I tested my model on those observations.
set.seed(1234)
newrecords_split <- newrecords %>%
initial_split(prop = 0.8, strata = "heart_disease")
newrecords_train <- training(newrecords_split)
newrecords_test <- testing(newrecords_split)
# check whether the training and testing data sets
# have the appropriate number of observations.
dim(newrecords)
## [1] 54760 18
dim(newrecords_train)
## [1] 43807 18
dim(newrecords_test)
## [1] 10953 18
In this part, I decided to use 4 different model classes (6 models in total).
Class 1: Logistic regression, LDA and QDA
Class 2: Boosted tree
Class 3: Random forest
Class 4: Nearest Neighbors
smoking,
alcohol_drinking,stroke,diff_walking,
sex, age_category, race,
diabetic, physical_activity,
gen_health,
asthma,kidney_disease,
skin_cancer; (According to the lecture, we should encode
all nominal predictors.)recipe <- recipe(
heart_disease ~ smoking+ alcohol_drinking+ stroke,physical_health+ mental_health+diff_walking+sex+age_category+race+diabetic+physical_activity+gen_health+sleep_time+asthma+kidney_disease+skin_cancer, data = newrecords_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
set.seed(1234)
newrecords_folds <- vfold_cv(data = newrecords_train, v = 5, strata = heart_disease )
Since these three models belong to the same model class, we want to select the best model among the three models to represent the best model in this class. We will finally use the four models from 4 different model classes to select the best model of the four model classes as our final model.
We specify a logistic regression model for classification using the “glm” engine. Then create a workflow. After that, we add model and the appropriate recipe (we created before).
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(recipe)
In a similar process, but this time specify a linear discriminant analysis model for classification using the “MASS” engine.
lda_mod <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
lda_wkflow <- workflow() %>%
add_model(lda_mod) %>%
add_recipe(recipe)
In a similar process, but this time specify a quadratic discriminant analysis model for classification using the “MASS” engine.
qda_mod <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_wkflow <- workflow() %>%
add_model(qda_mod) %>%
add_recipe(recipe)
control <- control_resamples(save_pred = TRUE)
log_fit <- fit_resamples(log_wkflow, newrecords_folds,control = control)
lda_fit <- fit_resamples(resamples = newrecords_folds,
lda_wkflow,
control = control)
qda_fit <- fit_resamples(qda_wkflow, resamples = newrecords_folds,
control = control)
We will use collect_metrics() to print the mean and standard errors of the performance metric accuracy across all folds for each of the four models.
We will decide which of the 3 fitted models has performed the best.
collect_metrics(log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.627 5 0.00166 Preprocessor1_Model1
## 2 roc_auc binary 0.651 5 0.00191 Preprocessor1_Model1
collect_metrics(lda_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.627 5 0.00166 Preprocessor1_Model1
## 2 roc_auc binary 0.651 5 0.00191 Preprocessor1_Model1
collect_metrics(qda_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.567 5 0.000701 Preprocessor1_Model1
## 2 roc_auc binary 0.650 5 0.000842 Preprocessor1_Model1
Based on the results we get here, we can see that these three models have very similar standard error of the accuracy (and the differences between them are very small). Since the mean accuracy of the Logistic regression equals to the mean accuracy of the LDA and both are higher than the mean accuracy of QDA, we know that logistic regression and LDA are better than QDA.
Also, we found that Logistic regression and LDA have the same mean and standard error roc_auc. This means that they do a same job. Let’s just choose Logistic regression as the best models among these three models.
log_fit_train <- fit(log_wkflow, newrecords_train)
log_test <- fit(log_wkflow, newrecords_test)
predict(log_test, new_data = newrecords_test, type = "class") %>%
bind_cols(newrecords_test %>% select(heart_disease)) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.622
# create a confusion matrix and visualize it
augment(log_test, new_data = newrecords_test) %>%
conf_mat(truth = heart_disease, estimate = .pred_class) %>%
autoplot(type = "heatmap")
# Plot roc_curve
augment(log_test, new_data = newrecords_test) %>%
roc_curve(heart_disease, .pred_Yes) %>%
autoplot()
# Calculate AUC
augment(log_test, new_data = newrecords_test) %>%
roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.646
rf_spec <- rand_forest(mtry = tune(),trees = tune(), min_n = tune()) %>%
set_engine("randomForest", importance = TRUE) %>%
set_mode("classification")
rf_wf <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(recipe)
Since we have 17 predictors, then by the lecture, we know that the range for mtry should not be smaller than 1 or larger than 17.
Since our dataset is very large, in order to prevent running for too long, we should set levels = 2, and we set the maximum of number of trees be 1000.
# set-up grid
param_grid_rf <- grid_regular(mtry(range = c(1, 17)),
trees(range = c(10, 1000)),
min_n(range = c(1, 10)),
levels = 2)
tune_res <- tune_grid(
rf_wf,
resamples = newrecords_folds,
grid = param_grid_rf,
metrics = metric_set(roc_auc)
)
# print result
autoplot(tune_res)
arrange(collect_metrics(tune_res), desc(mean))
## # A tibble: 8 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 1000 10 roc_auc binary 0.650 5 0.00175 Preprocessor1_Model7
## 2 1 1000 1 roc_auc binary 0.647 5 0.00474 Preprocessor1_Model3
## 3 1 10 10 roc_auc binary 0.641 5 0.00548 Preprocessor1_Model5
## 4 1 10 1 roc_auc binary 0.641 5 0.00345 Preprocessor1_Model1
## 5 17 10 1 roc_auc binary 0.627 5 0.00165 Preprocessor1_Model2
## 6 17 10 10 roc_auc binary 0.627 5 0.00165 Preprocessor1_Model6
## 7 17 1000 1 roc_auc binary 0.627 5 0.00167 Preprocessor1_Model4
## 8 17 1000 10 roc_auc binary 0.627 5 0.00167 Preprocessor1_Model8
We find that the mean of roc_auc of my
best_performing random forest model on the folds is 0.647 with
mtry = 1, trees = 1000 and min_n
= 10.
Use select_best() to choose the model that has the optimal roc_auc. Then use finalize_workflow(), fit(), and augment() to fit the model to the training set and evaluate its performance on the testing set.
best_complexity<- select_best(tune_res, metric= 'roc_auc')
class_forest_final <- finalize_workflow(rf_wf, best_complexity)
class_forest_final_fit <- fit(class_forest_final, data = newrecords_train)
augment(class_forest_final_fit, new_data = newrecords_test) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.622
augment(class_forest_final_fit, new_data = newrecords_test) %>%
conf_mat(truth = heart_disease , estimate = .pred_class) %>%
autoplot(type = "heatmap")
# Plot roc_curve
augment(class_forest_final_fit, new_data = newrecords_test) %>%
roc_curve(heart_disease, .pred_Yes) %>%
autoplot()
# Calculate AUC
augment(class_forest_final_fit, new_data = newrecords_test) %>%
roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.645
boost_tree_spec <- boost_tree(trees = tune(),
min_n = tune(),
mtry = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
boost_tree_workflow <- workflow() %>%
add_model(boost_tree_spec) %>%
add_recipe(recipe)
Since we have 17 predictors, then by the lecture, we know that the range for mtry should not be smaller than 1 or larger than 17.
Since our dataset is very large, in order to prevent running for too long, we should set levels = 2, and we set the maximum of number of trees be 2000.
# set-up grid
boost_tree_grid<- grid_regular(mtry(range = c(1, 17)),
trees(range = c(10, 2000)),
min_n(range = c(1, 10)),
levels = 2)
boost_tune_res <- tune_grid(
boost_tree_workflow,
resamples = newrecords_folds,
grid = boost_tree_grid,
metrics = metric_set(roc_auc),
)
autoplot(boost_tune_res)
arrange(collect_metrics(boost_tune_res), desc(mean))
## # A tibble: 8 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 10 10 roc_auc binary 0.650 5 0.00138 Preprocessor1_Model3
## 2 1 10 1 roc_auc binary 0.650 5 0.00125 Preprocessor1_Model1
## 3 1 2000 10 roc_auc binary 0.650 5 0.00127 Preprocessor1_Model4
## 4 1 2000 1 roc_auc binary 0.650 5 0.00127 Preprocessor1_Model2
## 5 17 10 10 roc_auc binary 0.650 5 0.00127 Preprocessor1_Model7
## 6 17 10 1 roc_auc binary 0.650 5 0.00127 Preprocessor1_Model5
## 7 17 2000 1 roc_auc binary 0.650 5 0.00127 Preprocessor1_Model6
## 8 17 2000 10 roc_auc binary 0.650 5 0.00127 Preprocessor1_Model8
We find that the mean of roc_auc of my
best_performing boosted tree model on the folds is 0.651 with
mtry = 1, trees = 10 and min_n =
10.
Use select_best() to choose the model that has the optimal roc_auc. Then use finalize_workflow(), fit(), and augment() to fit the model to the training set and evaluate its performance on the testing set.
best_boost_complexity<- select_best(boost_tune_res, metric= 'roc_auc')
boost_tree_final <- finalize_workflow( boost_tree_workflow, best_boost_complexity)
boost_tree_final_fit <- fit(boost_tree_final, data = newrecords_train)
# evaluate its performance on the testing set (estimate the accuracy)
augment(boost_tree_final_fit , new_data = newrecords_test) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.622
augment(boost_tree_final_fit, new_data = newrecords_test) %>%
conf_mat(truth = heart_disease , estimate = .pred_class) %>%
autoplot(type = "heatmap")
# Plot roc_curve
augment(boost_tree_final_fit, new_data = newrecords_test) %>%
roc_curve(heart_disease, .pred_Yes) %>%
autoplot()
# Calculate AUC
augment(boost_tree_final_fit , new_data = newrecords_test) %>%
roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.643
Lastly, I ran repeated cross fold validation on the K Nearest Neighbor model in the same fashion as the previous two models. For nearest neighbor, I tuned only neighbors as the model’s other defaults are fine, as mentioned in lab 07. I also set the workflow and added the recipe.
knn_model <- nearest_neighbor(neighbors = tune(),mode = "classification") %>%
set_engine("kknn")
knn_workflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(recipe)
# set-up tuning grid
knn_params <- parameters(knn_model)
# define grid
knn_grid <- grid_regular(knn_params, levels = 2)
knn_tune <- knn_workflow %>%
tune_grid(resamples = newrecords_folds,
grid = knn_grid,
metrics = metric_set(roc_auc))
autoplot(knn_tune)
arrange(collect_metrics(knn_tune),desc(mean))
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 roc_auc binary 0.501 5 0.000367 Preprocessor1_Model2
## 2 1 roc_auc binary 0.5 5 0 Preprocessor1_Model1
We find that the mean of roc_auc of my
best_performing boosted tree model on the folds is 0.501 with
neighbors = 15
Use select_best() to choose the model that has the optimal roc_auc. Then use finalize_workflow(), fit(), and augment() to fit the model to the training set and evaluate its performance on the testing set.
best_knn_complexity<- select_best(knn_tune, metric= 'roc_auc')
knn_final <- finalize_workflow( knn_workflow, best_knn_complexity)
knn_final_fit <- fit(knn_final, data = newrecords_train)
# evaluate its performance on the testing set (estimate the accuracy)
augment(knn_final_fit , new_data = newrecords_test) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.500
# create and visualize a confusion matrix heat map.
augment(knn_final_fit, new_data = newrecords_test) %>%
conf_mat(truth = heart_disease , estimate = .pred_class) %>%
autoplot(type = "heatmap")
# Plot roc_curve
augment(knn_final_fit, new_data = newrecords_test) %>%
roc_curve(heart_disease, .pred_Yes) %>%
autoplot()
# Calculate AUC
augment(knn_final_fit , new_data = newrecords_test) %>%
roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.501
# create a matrix that contains the accuracy and AUC
# from Model Logistic regression, Random Forest, Boosted tree, KNN
Model_Name <-c('Logistic regression', 'Random Forest', 'Boosted tree', 'KNN')
Accuracy <- c(0.622,0.622,0.622,0.500)
AUC <- c( 0.646,0.645,0.643,0.501)
information <- cbind(Model_Name,Accuracy,AUC)
# convert it to a tibble
as_tibble(information)
## # A tibble: 4 × 3
## Model_Name Accuracy AUC
## <chr> <chr> <chr>
## 1 Logistic regression 0.622 0.646
## 2 Random Forest 0.622 0.645
## 3 Boosted tree 0.622 0.643
## 4 KNN 0.5 0.501
# assignment log_test to final_model
final_model <- log_test
stroke is the most
important variable to our model.In other words, people who have had a
stroke are more likely to have heart disease.In order to show you how our model exactly work, let’s use 2 examples to predict whether people will get heart disease or not based on some key indicators.
# set up example 1
qualification1 <- data.frame(
bmi = 24,
smoking = 'No',
alcohol_drinking = 'No',
stroke = 'No',
physical_health = 0,
mental_health = 4,
diff_walking = 'No',
sex = 'Male',
age_category = '18-24',
race = 'Asian',
diabetic = 'No',
physical_activity = 'Yes',
gen_health ='Excellent',
sleep_time = 9,
asthma = 'No',
kidney_disease = 'No',
skin_cancer = 'No'
) %>%
mutate(
#heart_disease = factor(heart_disease, levels = c('Yes', 'No')),
smoking = factor(smoking, levels = c('Yes', 'No')),
alcohol_drinking = factor(alcohol_drinking, levels = c('Yes', 'No')),
stroke = factor(stroke, levels = c('Yes', 'No')),
diff_walking = factor(diff_walking, levels = c('Yes', 'No')),
sex = factor(sex),
age_category = factor(age_category),
race = factor(race, levels = c("American Indian/Alaskan Native", "Asian", "Black", "Hispanic", "White", "Other")),
diabetic = factor(diabetic, levels = c("No", "No, borderline diabetes", "Yes", "Yes (during pregnancy)")),
physical_activity = factor(physical_activity),
gen_health = factor(gen_health, levels = c("Excellent","Very good", "Good", "Fair", "Poor" )),
asthma = factor(asthma, levels = c('Yes', 'No')),
kidney_disease = factor(kidney_disease, levels = c('Yes', 'No')),
skin_cancer = factor(skin_cancer, levels = c('Yes', 'No')),
)
# do a prediction
predict(final_model, qualification1)
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 No
# set up example 2
qualification2 <- data.frame(
bmi = 40,
smoking = 'Yes',
alcohol_drinking = 'Yes',
stroke = 'Yes',
physical_health = 30,
mental_health = 30,
diff_walking = 'Yes',
sex = 'Female',
age_category = '75-79',
race = 'Asian',
diabetic = 'Yes',
physical_activity = 'No',
gen_health ='Poor',
sleep_time = 6,
asthma = 'Yes',
kidney_disease = 'Yes',
skin_cancer = 'Yes'
) %>%
mutate(
smoking = factor(smoking, levels = c('Yes', 'No')),
alcohol_drinking = factor(alcohol_drinking, levels = c('Yes', 'No')),
stroke = factor(stroke, levels = c('Yes', 'No')),
diff_walking = factor(diff_walking, levels = c('Yes', 'No')),
sex = factor(sex),
age_category = factor(age_category),
race = factor(race, levels = c("American Indian/Alaskan Native", "Asian", "Black", "Hispanic", "White", "Other")),
diabetic = factor(diabetic, levels = c("No", "No, borderline diabetes", "Yes", "Yes (during pregnancy)")),
physical_activity = factor(physical_activity),
gen_health = factor(gen_health, levels = c("Excellent","Very good", "Good", "Fair", "Poor" )),
asthma = factor(asthma, levels = c('Yes', 'No')),
kidney_disease = factor(kidney_disease, levels = c('Yes', 'No')),
skin_cancer = factor(skin_cancer, levels = c('Yes', 'No')),
)
# do a prediction
predict(final_model, qualification2)
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 Yes
In order to generate a model that will predict whether people will get heart disease or not based on some key indicators (these indicators are the 17 variables predictors in our dataset), we used a an undersampling technique to deal with survey data of over 300k US residents from the year 2020. We tried four different types of models and tested a total of seven models. (They are Logistic regression, LDA, QDA, Random Forest, Boosted tree and KNN.)
The following table shows the accuracy and AUC values of our fitted models.
# create a matrix that contains the accuracy and AUC
# from Model Logistic regression, Random Forest, Boosted tree, KNN
Model_Name <-c('Logistic regression', 'Random Forest', 'Boosted tree', 'KNN')
Accuracy <- c(0.622,0.622,0.622,0.500)
AUC <- c( 0.646,0.645,0.643,0.501)
information <- cbind(Model_Name,Accuracy,AUC)
# convert it to a tibble
as_tibble(information)
## # A tibble: 4 × 3
## Model_Name Accuracy AUC
## <chr> <chr> <chr>
## 1 Logistic regression 0.622 0.646
## 2 Random Forest 0.622 0.645
## 3 Boosted tree 0.622 0.643
## 4 KNN 0.5 0.501
Finally, we found that the logistic regression model performed well since it has the highest accuracy and AUC value, and KNN performed poorly since it has the lowest accuracy and AUC value. Although the logistic regression is the best model in all these models, the accuracy of our final model (logistic regression) is only 62.2%, which does not look very good.
I think the reason why the accuracy is not very high is that we used the understampling technique at the process of data cleaning. Our original dataset is very imbalanced. There is 292422 people don’t have heart disease, and only 27373 people have heart disease. We used a understampling technique to randomly delete 265035 observations without heart disease. Note that the data we deleted accounted for 90.6% of the original data without heart disease. Using the understampling technique to delete a large amount of data will inevitably reduce the accuracy of our model. I think this can explain why the accuracy of our model is only 62.2% and AUC value is only 0.646.
Overall, we generated a model that will predict whether people will get heart disease or not based on some key indicators in this project. By using this model, we can remind people to pay attention to health and lifestyle habits to help them avoid heart disease. Hope we can all stay away from heart disease and have a healthy and happy life.